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1. Summary 

(1) Phylogcnetic diversity (PD) depends on sampling intensity, which com- 
plicates the comparison of PD between samples of different depth. One 
approach to dealing with differing sample depth for a given diversity statis- 
tic is to rarefy, which means to take a random subset of a given size of the 
original sample. Exact analytical formulae for the mean and variance of 
species richness under rarefaction have existed for some time but no such 
solution exists for PD. 

(2) We have derived exact formulae for the mean and variance of PD under 
rarefaction. We show that these formulae are correct by comparing exact 
solution mean and variance to that calculated by repeated random (Monte 
Carlo) subsampling of a dataset of stem counts of woody shrubs of Toohey 
Forest, Queensland, Australia. We also demonstrate the application of the 
method using two examples: identifying hotspots of mammalian diversity in 
Australasian ecoregions, and characterising the human vaginal microbiomc. 

(3) There is a very high degree of correspondence between the analytical and 
random subsampling methods for calculating mean and variance of PD un- 
der rarefaction, although the Monte Carlo method requires a large number 
of random draws to converge on the exact solution for the variance. 

(4) Rarefaction of mammalian PD of ecoregions in Australasia to a common 
standard of 25 species reveals very different rank orderings of ecoregions, 
indicating quite different hotspots of diversity than those obtained for un- 
rarefied PD. The application of these methods to the vaginal microbiome 
shows that a classical score used to quantify bacterial vaginosis is correlated 
with the shape of the rarefaction curve. 

(5) The analytical formulae for the mean and variance of PD under rarefaction 
are both exact and more efficient than repeated subsampling. Rarefaction of 
PD allows for many applications where comparisons of samples of different 
depth is required. 

2. Keywords 

alpha diversity; exact formulas; phylogenetics; sampling depth 
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3. Introduction 



Phylogcnctic Diversity (PD), the total branch length of a tree, has been ex- 
tensively used as a measure of biodiversity. Originally conceived of as a method 



for prioritising regions for conservation (Faith 1992), PD has seen wider use in 



other applications such as biogeography (Davies and Buckley 2011), macroecol- 



ogy (Meynard et al. 



2008). 



2011) and microbial ecology (Lozupone and Knight 
This increasing breadth of application can be attributed to a number of desirable 
properties including: 1) explicitly addressing the non-equivalence of species in their 
contribution to overall diversity; 2) acting as a surrogate for other aspects of diver- 



sity such as functional diversity (Cadotte et al. 2009); 3) incorporating information 



on the evolutionary history of communities and biotas; and 4) being robust to prob- 
lems of species delineation because the relationships between populations and even 
individuals can be represented by relative branch lengths without the need to es- 



tablish absolute species identity. Further, the original simple formulation of Faith 
( 1992 ) has been built on to produce a broader "PD calculus" measuring such as- 

iRosauer et al. 



pects of diversity as endemism (Faith et al. 2004 



(Allen et al. 2009) and resemblance (Faith et al. 



2009), evenness 



2009; Nippcress et al. 2010). 



Phylogenetic diversity increases with increasing sampling effort just like many 
other measures of biodiversity. Thus, the comparison of the phylogenetic diversity 
of communities is not straightforward when sample sizes differ, as is common with 
real data sets. Unless data are standardised in some sense to account for differences 
in sample size or effort, the relative diversity of communities can be profoundly 



misinterpreted (Gotelli and Colwell 2001). 



The established solution to the problem of interpreting diversity estimates with 
samples of varying size is rarefaction. The rarefaction of a given sample of size 
n to a level k is simply the uniform random choice of k of the n observations 
(typically without replacement) . The observations are typically of either individual 
organisms or collections of organisms, giving either individual-based or sample- 
based rarefaction curves ( |Gotelli and Colwell| 2001). To consider a given measure 
of diversity under rarefaction, the measure of diversity is simply applied to the 
rarefied sample. Researchers are typically interested in the expectation and/or 
variance of a measure of diversity under rarefaction. 

Rarefaction curves can be used to understand the depth of sampling of a com- 
munity compared to its total diversity. Additionally, rarefaction curves capture 



information about evenness (Olszewski 20041 and beta-diversity (Crist and Veech 



2006), depending on whether observations are of individuals or collections. Rarefac- 



tion curves have been computed for phylogenetic diversity ( 


Lozupone and Knight 


2008 


Turnbaugh et al. 


2008 Caporaso et al. 


2012 


Yu et al. 


2012 


). In each of 



these cases, rarefaction was not by counts of individual organisms or collections of 
such, but was instead based on counts of unique sequences or Operational Taxo- 
nomic Units. Rarefaction by such units, including taxonomic species, makes sense 
in the context of phylogenetic diversity where it might not with other measures of 
biodiversity. In effect, with these examples, rarefaction is by the tips of the tree 
and the resulting curve gives an indication of tree shape and distribution of sample 
observations amongst the tips of the tree. 

One way to obtain summary statistics such as expectation and variance under 
rarefaction is to compute these statistics on samples drawn using a Monte Carlo 
procedure, that is, calculate the desired statistics on a collection of random draws. 
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There are closed form solutions for the mean of many measures of biodiversity 
under rarefaction. For example, an analytical solution is well-known for species 
diversity, can be calculated for rarefaction by individuals and samples, and is much 



more efficient than resampling ( jHurlbert 1971 Ugland et al. 2003 Chiarucci et al 



2008). However, we are not aware of such a formula for any phylogenetic diversity 
metrics. 

In this paper, we establish analytical formulae for the mean and variance of 
phylogenetic diversity under rarefaction. We develop these formulas in the setting 
of a phylogenetic tree with "marks" , which are a simple generalization allowing 
multiplicity of observations and arbitrary positions of observations along the tree. 



4. Materials and methods 

There are two different notions of the induced phylogenetic diversity (PD) of a 
subset K of the leaves L of a tree T\ these notions have been called unrooted and 



rooted PD (Pardi and Goldman 2007). Unrooted PD is the total branch length of 
the smallest unrooted subtree contained in T that has all of the leaves in K. Rooted 
PD is the total branch length of the smallest rooted tree containing the original 
root of T as well as the selected leaves K. The rooted definition was that originally 
intended by Faith (1992): see (Faith 2006) for a historical discussion. These two 



need not be the same: for example, any K consisting of a single element will have 
zero unrooted phylogenetic diversity, but nonzero rooted phylogenetic diversity. 

These two definitions are useful in different domains of application. For example, 
for conservation applications keeping a single species has significant value, thus it 
makes sense to have nonzero PD for a single species. On the other hand when 
comparing ecological diversity between environments, it may not make sense to 
keep the root, in which case the diversity between the members of a set of size one 
is zero. 

We will derive formulae for both definitions of PD. However, the description of 
the variance of unrooted PD will be deferred to the Appendix. 

Formulae for rarefaction of phylogenetic diversity can be easily and productively 
generalized from the notion of tree to the notion of a tree with marks, which allows 
more flexibility in abundance weighting and attachment locations. We define a tree 
with marks as a tree along with a collection of special points on the tree, which 
may be present with multiplicity. The induced subtree of a collection of points on 
a phylogenetic tree is the smallest connected set that contains all of those points. 
The phylogenetic diversity of a (sub)tree is the total branch length of the tree. 

In this setting, marks represent observations, thus if a certain leaf taxon t is 
observed x times, x marks are put at t. However, it is just as easy to generalize to 
the setting where marks appear at internal nodes of the tree. The motivation for 
working in terms of marks is that it formalizes the notion of observation count and 
affords some extra flexibility for location of observations. In particular, microbial 
ecologists often census a given community by high-throughput sequencing, and it 
is not practical to build a phylogenetic tree on all of the sequences thus created. 
For this reason, scientists sometimes map sequences to trees using either similarity 



search plus a most recent common ancestor strategy, as in the work of Huson et al. 



(2007 


), or "place" the sequences 


et al. 


2011 


Matsen et al. 


2010) 



The attachment point of a mapping of a sequence 



into a tree can be considered as a mark. 
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The unrooted phylogenetic diversity of a tree with marks is the total branch length 
of the tree induced by those marks, that is, the total branch length of the smallest 
connected subset of the tree containing the marks. The rooted phylogenetic diversity 
of a rooted tree with marks is equal to the unrooted phylogenetic diversity of the 
tree with the given marks along with a mark added at the root; in this case the path 
from the root to the selected leaves is always included in the PD calculation. These 
are simple generalizations of the corresponding definitions for leaf observations. 

The following sections will be concerned with rarefying the collection of marks 
and computing phylogenetic diversities of the corresponding induced subtrees. We 
will use distal to mean the direction towards the leaves, while proximal is the 
direction towards the root. If T is unrooted, we will still use these terms for 
descriptive purposes; in this case an arbitrary root can be assigned. 

We fix a non-empty collection M of n marks on a tree T, and some number 
1 < k < n of marks to sample for our rarefaction. Again, marks can be present 
multiple times in a collection, enabling the expression of multiplicity of observation 
of a taxon or sequence. 

Definition 1. Define an edge snip to be a maximal segment of an edge with no 
marks or internal nodes. 

Say there are s snips on the tree with marks, and let li be their length for 
1 < i < s. Let d be the set of marks that are proximal to snip i, and Dj be the 
set that are distal to snip i. 

Definition 2. For every I < i < s, let X\ be the random variable that is equal to 
one if there is at least one mark on the distal side of snip i after rarefaction, and 
zero otherwise. Let Xf be the random variable that is equal to one if there is at 
least one mark on each side of snip i after rarefaction, and zero otherwise. 

The following two statements are true for X e {X r ,X U } with the corresponding 
Y e {Y r ,Y u }. The phylogenetic diversity Y after rarefaction can be expressed as 
the random variable 



because the length of a snip i contributes to the PD exactly when the corresponding 



To calculate expectations and covariances of the X iy the following definition will 
be useful. Fix an R C M. Let qk{R) be the probability that nothing in R is selected 
in a uniform sample of size k from M without replacement. Recalling that n = \M\, 
note that 



(1) 





1 otherwise 

with the understanding that (q) = 1 for all x € N. 

Note that the qk(R) can be calculated for successive k by observing that 




n-\R\-k 
q k +i(R) = r — Qk(R)- 
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Because the q k only depend on the size of R, a computer implementation only needs 
to calculate the q k (R) once for any R of a given size; the q k (R) notation was chosen 
for convenience. 

4.1. Rooted phylogenetic diversity. As described above, rooted phylogenetic 
diversity does PD calculation while always including the root. By (JlJ and ^ all 
that is needed is the mean and the covariance matrix of the X['s. Note that X\ is 
zero unless at least one element of Di is sampled, in which case it is one. Thus 

(3) E[XT]=l-9fc(A). 

X[ Xj is zero unless the rarefaction samples at least one element of both Di and 
Dj, in which case it is one. The probability that one or both of these are empty 
under rarefaction is q k (D{) + q k {Dj) — q k (Di U Dj), thus 



E[XTXV] = 1 - q k {Di) - q k {D 3 ) + q k {Di U Dj). 

E[Xr]E[XJ] = 1 - q k {Di) - q k {Dj) + q k (Di)q k (Dj), 
Cov(Xr,Xj) = q k (D t U Dj) - q k {Di)q k (Dj). 

E[Y r }=J2ti [l-?fc(A)] 

i 

Var[y] =J2 £ i £ j [«*(A U Dj) - %(A)&(A)] • 

hi 

This solution can be seen to be a generalisation of the analytical formulae for the 
mean and variance of expected species richness under rarefaction ( Hurlbertj |1971| 



By §, 
thus, 

In summary. 



Heck Jr et al. 1975). Consider the special case of a "star" tree with all tips sharing 
a single common ancestor, where all marks are located at the tips of the tree (with 
the exception of one mark placed at the root), and where all branch lengths (and 
thus all snips) have a length of one. Under these particular circumstances, the 
species richness and phylogenetic diversity of the collection of marks are equal and 
the formulae for mean and variance of expected phylogenetic diversity simplify to 
their equivalents for species richness. 

4.2. Unrooted phylogenetic diversity. Assume as above that we are sampling 
k > marks for our rarefaction. By the definition of a snip, it is not possible for 
the rarefaction samples from both Ci and Di to be empty. Thus these two events 
are mutually exclusive, and 

(4) E[X?] = 1 - q k (d) - q k (Di). 
Then, by and @, 

(5) E(Y u )=Y / ^ll-1k(C l )-q k (D l )}. 



The variance of the unrooted case is deferred to the Appendix. 
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4.3. Example applications. We demonstrate our method for calculating the mean 
and variance of phylogenetic diversity under rarefaction using three examples. In 
the first, we compare the rarefaction curve generated by Monte Carlo randomisa- 
tion to that calculated by the exact analytical solution. The data are counts of 
stems of all woody shrubs in forty plots in Toohey Forest, Queensland, Australia. 
Within each plot, all plant stems over 0.3 m and under 3.0 m were counted; this 
figure was used as an index of abundance. All shrubs were identified to species and 



a composite phylogeny was compiled from multiple published trees; see (Nipperess 
et all 120101) for a more detailed description of the data. Stem counts were summed 



across all plots to produce a single value per species before rarefaction by individual 
stems. Of the total of 582 stems, rarefied values were calculated for every multiple 
of 10 stems from 10 to 580. For the Monte Carlo procedure, mean and variance 
of phylogenetic diversity were calculated from 2,000 random subsamples of size k 
from the pool of 582 stems. 

Our second example demonstrates rarefaction of phylogenetic diversity by units 
of species. Phylogenetic diversity of extant mammals was calculated for each terres- 
trial ecoregion on the Australian continental shelf (that is, Australia along with Tas- 
mania, New Guinea and offshore islands). Terrestrial ecoregions are biogeographic 
units representing distinct species assemblages (Olson et al. 2001). Species lists of 
mammals for each ecoregion were sourced from the WildFinder database maintained 
by the World Wildlife Fund ( |http: / /www.worldwildlife.org/science/wildfinder/ [). 
Evolutionary relationships were sourced from a species-level supertree of the world's 



mammals ( Bininda-Emonds et al. 2007). Because of the strong correlation between 



species richness and phylogenetic diversity, rarefaction allows for the comparison of 
ecoregions with the effect of spatial variation in species richness removed. To do 
this, the expected phylogenetic diversity for a subset of 25 mammal species was cal- 
culated for each ecoregion. The value of 25 was chosen because it was the minimum 
species richness for this set of ecoregions. 

Our third example comes from the human microbiome. We reanalyze a py- 
rosequencing dataset describing bacterial communities from women with bacterial 
vaginosis (BV) . BV has previously been shown to be associated with increased mi- 
crobial community diversity (Fredricks et al. 2005). For this study, swabs were 



taken from 242 women from the Public Health, Seattle and King County Sexu- 
ally Transmitted Diseases Clinic between September 2006 and June 2010 of which 



220 samples resulted in enough material to analyze (Srinivasan et al. 2012) (data 



available as Sequence Read Archive submission SRA051298). Vaginal fluid for each 
specimen was also evaluated according to Nugent score, which provides a diag- 
nostic score for BV which ranges from (BV-negative) to 10 (BV-positive) based 
on presence and absence of bacterial morphotypes as viewed under a microscope 
(Nugent et al. 1991). Selection of reference sequences and sequence preprocessing 
were performed using the methods described by Srinivasan et al. (2012). 452,358 
reads passed quality filtering, with a median of 1,779 reads per sample (range: 523- 
2,366). For this application, we investigated the shape of the rarefaction curves with 
respect to resampling. 
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Figure 1. Comparison of analytical value (curve) with Monte 
Carlo calculation with 2,000 samples (points) for the mean of 
rooted PD under rarefaction. 




Rarefaction size (k) 



Figure 2. Comparison of analytical value (curve) with Monte 
Carlo calculation with 2,000 samples (points) for the variance of 
rooted PD under rarefaction. 

5. Results 

There was a very high degree of correspondence between the analytical and 
Monte Carlo methods for the expected value and variance of phylogenetic diversity 
of the Toohey Forest dataset under rarefaction (Fig. [I|and[2| corresponding results 



8 



for unrooted PD not shown). In this application the Monte Carlo estimate of the 
PD variance does not converge quickly to the exact value, as can be seen from the 
deviations of the points (generated from 2,000 Monte Carlo samples) from the line 
in Fig. [2j Such slow convergence provides further motivation for an exact formula. 

Correcting phylogenetic diversity for the number of species present made a large 
difference to the ranking of terrestrial ecoregions in terms of their diversity (Fig. [3]). 
With unrarefied phylogenetic diversity, the three highest ranked ecoregions (South- 
eastern Papuan rainforests, Southern New Guinea lowland rainforests, Central 
range montane rainforests) are found in New Guinea. However, when variation 
in species richness is taken into account by rarefaction, two of the three highest 
ranked ecoregions (Australian Alps montane grasslands, Naracoorte woodlands) 
were in temperate Australia. Thus the rarefied version demonstrates high phyloge- 
netic diversity for this data set relative to the number of species present for those 
regions. 




Figure 3. Phylogenetic diversity of mammal faunas for terres- 
trial ecoregions on the Australian continental shelf. Phylogenetic 
diversity is calculated for (a) all species present and (b) as an 
expected value after rarefaction to 25 species. Ecoregions are 
coloured light blue for low values to dark red for high values. The 
three highest ranked ecoregions in each case are indicated by num- 
ber. 



The rarefaction curves for the vaginal samples shows a connection between the 
Nugent score of the sample and the shape of the curve (Fig. |4| . The rarefaction 
curves for low Nugent score samples tend to start low and stay low. The high 
Nugent score samples typically start higher than low Nugent score samples, and 
stay high. 

6. Discussion 

We have presented exact formulae for the mean and variance of rooted and un- 
rooted phylogenetic diversity under rarefaction. This solution gives results that are 
indistinguishable from those given by Monte Carlo randomisation. The analytical 
method is preferred both because its results are exact (rather than estimated by 
subsampling) and can be more efficient than sampling. 

Rarefaction of phylogenetic diversity is seeing growing use in a variety of bio- 
logical disciplines and we highlight two specific applications here. Rarefaction of 



Rarefaction Size 



Figure 4. Rarefaction curve of samples from (Srinivasan et al. 



2012|. The Nugent score is a diagnostic score for bacterial vagi- 



nosis, with being "normal" and 10 being classified as BV. 



phylogenetic diversity by units of species allows for the assessment of phylogenetic 
diversity independently of species richness. Removing the influence of species rich- 
ness can allow for the fairer comparison of evolutionary histories of faunas and 
floras. For example, Davies and Buckley (2011) investigated the evolutionary his- 
tory of mammal faunas by mapping the residuals of a LOESS regression between 
phylogenetic diversity and species richness for 100 x 100 km grid cells across the 
globe. However, while certainly better than fitting a linear model, such a cor- 
rection is unnecessary when the relationship between phylogenetic diversity and 
species richness can be described exactly by a rarefaction curve. 

The rarefaction curves for the vaginal samples give interesting information about 
the distribution of phylotypes in the vaginal microbiome. Some of this information 
recapitulates prior knowledge. For example, samples with low Nugent score are 
typically dominated by a handful of bacterial species in the Lactobacillus genus. 
These rarefied curves start low and stay low. If there are also other distantly- 
related organisms present, but in low abundance, the curve can start low and then 
curve up to a high level. The high Nugent score samples, that tend to start high 
and increase rapidly, indicate that there are a considerable number of taxa spread 
across the tree that appear in the samples with reasonably high count. 

Software implementing the exact analytical solution for rarefaction of phyloge- 
netic diversity are already available. The phylorare and phylocurve functions are 
implemented in the R statistical environment (R Development Core Team 2010). 
These functions calculate mean rooted phylogenetic diversity and can be used to 
standardise a set of samples to a particular level of sampling effort (phylorare) 
or generate a rarefaction curve for units of individuals, collections or species (phy- 
locurve). These functions can b e downloaded from|http://david nipperess.blogspot xom.au/| 
The pplacer suite of programs (http://matsen.fhcrc.org/pplacer/) is a collection of 
programs for "phylogenetic placement" and associated analyses. The guppy soft- 
ware is the main binary to perform downstream analysis of collections of placements. 
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It calculates phylogenetic diversity as well as a number of other phylogenetic diver- 
sity measures, including balance-weighted phylogenetic diversity (McCoy and Mat- 
sen, in preparation), phylogenetic entropy (Allen et al. 2009), and phylogenetic 



quadratic entropy (|RaoJ [1982J) . It also calculates PD rarefaction curves as shown 
here, as well as those for phylogenetic quadratic entropy. 

Future work will include sensitivity of the PD rarefaction curve to tree shape 
and the distribution of individuals among species. 
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8. Appendix 

Here we will calculate the variance of the unrooted phylogcnctic diversity Y u via 
First note that (X") 2 = Xf because Xf only takes the values 1 and 0. Thus 
if i = j then Cov(X?,XJ) = E[X 4 U ] - E[X?] 2 which can be calculated using @. 

Now assume i ^ j. Instead of expressing the variance in terms of functions of 
Cj 's and D^s, we will express them in terms of the following quantities defined for 
a pair of edges i and j. Hi is proximal to j, let Sij be Di (which is the Same side 
of i as j), Oij be (which is on the Other side of i from j), Sjj be Cj, and Oj.i 
be Dj. If j is proximal to i, the roles of i and j are reversed. If the path from i 
to j traverses the root, then let Sij be C,*, Ojj be S^j be Cj, and O^i be ZX,. 
Since Oij and Si j are just Cj and Di in some order, we can think of Xf being 
the random variable equal to one if the rarefaction sample from both Oij and Sij 
are nonempty and zero otherwise; the corresponding definition is true for X™. By 
these definitions, a key point is that O^j C Sj t i and Oj.i C Sij. 

To calculate Cov(X? ,Xf) = E[X?X%} - E[X?]E[XJ], note that X?XV is zero 
unless the rarefaction samples at least one element of both Oij and Oj^ 1 in which 
case it is one. The probability that one or both of these are empty under rarefaction 
is qk(Oi,j) + qk{Oj,i) ~ Qk(Oi,j U O j:i ), thus (for i ^ j) 

E[X?XJ] = 1 - q k (O itj ) - q k (O j4 ) + q k (O itj U O iti ). 

By @, 

E[X?}E[Xf] = [1 - q k (O itj ) - q k (Sij)][l - q k {0^ - q k (S jti )\ 

= 1 - qk(Oi,j) - q k (O jti ) + qk(Oi,j)q k (Oj t i) 

- [1 - qk(Oij)]q k {S jt i) ~ q k (Sij)[l - q k {O hi )\ + q k (S it j)q k (Sj,i). 

Thus (again, for i ^ j), 

Cov(X?,X}) = q k {O id U O iti ) - q k (0. hj )q k {0^i) + [1 - q k (O h3 )]q k (S hl ) 

+ q k (Sij)[l - q k (0 Jti )} - q k (Sij)q k {Sj,i). 

These expressions can then be substituted back into (|2| to get an expression for 
the variance of phylogenetic diversity. 



